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Department  of  Computer  Science 
Duke  University 

Under  the  auspicies  of  this  grant,  we  have  developed  a  hierarchical  ,  combinatorial-Markov 
approach  for  solving  large  reliability/  availability/  performance  models  of  systems.  The  ap¬ 
proach  allows  the  modeler  to  combine  good  aspects  of  both  combinatorial  models  and  Markov 
models  to  obtain  a  cost-effective  solution  to  large  models.  The  approach  haa  been  used  in 
..  two  Ph.D.  dissertations  (LTCL  Jim  Blake,  who  studied  the  performability  of  multprocessor 
Interconnection  networks  and  Malathi  Veeraraghavan,  who  modeled  many  fault-tolerant  sys¬ 
tems,  including  Boeing’s  IAPSA,  Draper  Laboratories’  AIPS).  Jim  Blake  has  joined  the  Army 
AIRMICS  Laboratory  and  Malathi  has  joined  AT&T  Bell  Laboratories  in  Columbus^ 

<'Other  methods  of  dealing  with  complex  system  models  that  we  have  explored  include  au¬ 
tomated  methods  of  Markcv  model  generation  using  Stochastic  Petri  nets.  Efficient  methods 
of  solving  stochastic  Petri  net  models  are  being  investigated  by  Gianfranco  Ciardo  in  his  dis¬ 
sertation.  His  work  also  involves  applying  SPN  techniques  for  the  performance  analysis  of 
concurrent  programs^ 

Much  of  our  research  deals  with  the  transient  solution  of  large  and  stiff  Markov  and  Markov 
reward  models..  We  have  developed  a  decomposition  technique  for  the  transient  analysis  of  stiff 
Markov  chains  saintly  with  Dr.  A.  Bobbio  of  Institute  Ferraris,  Torino,  Italy.  A  description 
of  the  technique  was  published  in  the  IEEE  Transactions  on  Computers  (September  1986) 
and  is  receiving  wide  attention.  We  have  carried  out  a  thorough  comparision  of  the  transient 
anjilymtimethods  of  Markov  models  within  the  scope  of  the  Ph.D.  thesis  by  Andrew  Reibman. 
This  work  has  received  attention  in  Applied  Probability  and  Operations  Research  community. 
Andrew  has  accepted  a  position  at  AT&T  Bell  Laboratories  in  order  to  further  utilise  this 
research  in  solving  reliability  models  of  communication  systems. 

Our  work  on  Markov  reward  models  is  important  not  only  because  we  have  developed  an 
efficient  algorithm  for  numerical  solution  but  also  because  of  a  large  variety  of  applications 
we  are  exploring.  The  research  on  Markov  reward  models  consists  of  interdisciplinary  (with 
Dr.  Kulkarni  of  Operational  Research  Curriculum  at  the  University  of  North  Carolina)  and 
International  (Dr.  Francois  Baccelli  of  INRIA,  France  and  Dr.  Raymond  Marie  of  IRISA, 
Rennes,  France)  collaborations.  The  applications  have  addressed  the  effectiveness  evaluation  of 
10X 16  multiprocessor  systems  with  various  interconnection  schemes,  response-time  distribution 
in  an  M/M/1  queue  with  processor  sharing  discipline,  distribution  of  time-averages  in  queueing 
systems,  and  response  time  distributions  of  tasks  in  a  system  subject  to  failure  and  repair.  Three 
Ph.D.  dissertations  have  been  completed  in  this  area.  One  by  Victor  Nicola,  who  has  joined 
IBM  Yorktown  Research  Center,  second  by  Roger  Smith,  who  has  joined  Yale  University  and 
the  third  by  Phil  Chimento,  who  w  ill  remain  at  IBM  Research  IVaingle  Park. 

Another  important  area  of  research  is  in  the  analysis  of  the  coverage  of  a  fault  tolerant  sys¬ 
tem,  that  is,  the  probablity  that  the  system  can  recover  from  a  fault.  We  have  studied  a  variety 
of  models,  from  simple  phase-type  models  to  very  complex  stochastic  Petri  net  models,  and 
have  investigated  solution  techniques  for  each  model  type.  Our  methodology  allows  considera¬ 
tion  of  external  events  that  can  interfere  with  recovery,  such  as  a  hard  limit  on  recovery  time, 
or  the  occurrence  of  a  second  near-coincident  fault.  We  discovered  that  a  policy  of  attempting 
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transient  recovery  upon  detection  of  an  error  (aa  opposed  to  automatically  reconfiguring  the 
affected  component  out  of  the  system)  may  actually  increase  the  unreliability  of  the  system. 
This  result  holds  if  the  error  detectability  is  not  nearly  perfect,  so  that  the  risk  of  producing 
an  undetectable  error  (if  the  transient  error  is  present)  is  greater  than  the  benefit  gained  by 
not  discarding  the  component. 

A  list  of  all  papers  and  thesis  supported  in  part  of  by  this  grant  is  attached  alongwith. 
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